Statika: managing cloud resources, bioinformatics tools and data

نویسندگان

  • Alexey Alekhin
  • Evdokim Kovach
  • Pablo Pareja-Tobes
  • Marina Manrique
  • Eduardo Pareja
  • Raquel Tobes
  • Eduardo Pareja-Tobes
چکیده

Next Generation Sequencing (NGS) has brought a revolution to the bioinformatics l andscape, de finitely r eshaping f ields s uch a s g enomics a nd transcriptomics, by offering sheer amounts of data about previously inaccessible d omains i n a c heap a nd s calable w ay. Thus bi ological da ta a nalysis d emands, more than ever, high performance computing architectures; in particular, Cloud Computing, a comparable breakthrough in the IT world, holds promise f or be ing t he f oundation o n which a s olution c ould be built ( as a lready demonstrated by pioneering efforts such as Galaxy or CloudBioLinux). It provides a perfect framework for high throughput data analysis: deploying architectures with as much co mputing cap acity as n eeded, s caling in an horizontal way, being also able to scale down adjusting to the computing needs real time, or the pay-as-you-go model make for a strong case. However, fast, reproducible, and cost-effective data analysis in the cloud at such scale remains elusive. Certainly, one fundamental prerequisite for achieving this is having the ability to manage both the tools and data to be used in a robust, reproducible, and automated way. High throughput analysis, where a lot of resources are to be used and paid for, needs to have a robust configuration system to rely on. In the cloud computing world, due to its on-demand nature, automated resource configuration is a critical factor. This is even more so in the case of bioinformatics analysis where pretty often a pretty intricate and unstable chain of dependencies underlies tools and data; knowing beforehand that all the resources to be used are properly configured is invaluable. Statika ( http://ohnosequences.com/statika) ai ms t o b e a b asic t ool f or t he declaration and deployment of c omposable, versioned and r eproducible c loud infrastructures for the bioinformatics space. Data, tools and infrastructure are treated on an equal footing, and a ex pressive domain specific language al lows the user to express complex dependency relationships, c heck for pos sible version c onflicts a nd a utomatically c hoose a safe resource creation order. By making us e of a dvanced features of t he S cala pr ogramming l anguage such as dependent types and type-level co mputations a g reat deal o f s tructure can be expressed abstractly, and checked at compile time before any cost is incurred. A s trong ve rsioning s ystem where bot h da ta a nd t ools a re i ncluded makes reproducibility not only possible but actually enforced. Proceedings IWBBIO 2014. Granada 7-9 April, 2014 1412 Statika has been put to work on scenarios as different as a cloud-based system for scaling inherently parallel computations in the bioinformatics domain: Nispero, or by pr oviding v ersioned a nd m odular a utomated de ployments of Bio4j, a g raph database integrating all data from key resources in the bioinformatics data space, including: UniProt, Gene Ontology, the NCBI Taxonomy or UniRef. We use i t internally for the integration and automated deployment of all sort of bioinformatics tools and data. Statika is open source, available under the AGPLv3 license. This pr oject i s f unded i n pa rt by t he I TN F P7 pr oject I NTERCROSSING (Grant 289974). Proceedings IWBBIO 2014. Granada 7-9 April, 2014 1413

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Fuzzy retrieval of encrypted data by multi-purpose data-structures

The growing amount of information that has arisen from emerging technologies has caused organizations to face challenges in maintaining and managing their information. Expanding hardware, human resources, outsourcing data management, and maintenance an external organization in the form of cloud storage services, are two common approaches to overcome these challenges; The first approach costs of...

متن کامل

Multilevel parallelism in sequence alignment using a streaming approach

Ultrascale computing and bioinformatics are two rapidly growing fields with a big impact right now and even more so in the future. The introduction of next generation sequencing pushes current bioinformatics tools and workflows to their limits in terms of performance. This forces the tools to become increasingly performant to keep up with the growing speed at which sequencing data is created. U...

متن کامل

Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud

BACKGROUND Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visuali...

متن کامل

GeMSTONE: orchestrated prioritization of human germline mutations in the cloud

Integrative analysis of whole-genome/exome-sequencing data has been challenging, especially for the non-programming research community, as it requires simultaneously managing a large number of computational tools. Even computational biologists find it unexpectedly difficult to reproduce results from others or optimize their strategies in an end-to-end workflow. We introduce Germline Mutation Sc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014